케라스 API는 신경망 모델을 만들기 위한 텐서플로의 래퍼(wrapper)이다.

훈련데이터 셋이 작고 텐서로 메모리에 적재할 수 있는 경우에는 텐서를 이용하여 (케라스 API로 만든) 텐서플로 모델을

.fit() 메서드로 바로 훈련할 수 있다.

일반적으로 데이터셋이 컴퓨터 메모리보다 클 경우 저장 장치(하드 드라이버, SSD)에서 데이터를 나누어 배치 단위로 적재해야 한다.(미니 배치)

텐서플로 데이터셋

a=[1.2, 3.4, 7.5, 4.1, 5.0, 1.0]

ds=tf.data.Dataset.from_tensor_slices(a)

print(ds)

for item in ds:

print(item)

tf.Tensor(1.2, shape=(), dtype=float32)

tf.Tensor(3.4, shape=(), dtype=float32)

tf.Tensor(7.5, shape=(), dtype=float32)

tf.Tensor(4.1, shape=(), dtype=float32)

tf.Tensor(5.0, shape=(), dtype=float32)

tf.Tensor(1.0, shape=(), dtype=float32)

배치 크기가 3인 데이터셋

ds_batch=ds.batch(3)

for i, elem in enumerate(ds_batch, 1):

print('batch {}:'.format(i), elem.numpy())

batch 1: [1.2 3.4 7.5]

batch 2: [4.1 5. 1. ]

데이터가 배치로 나누어 떨어지지 않을 때, drop_remainder 매개변수를 사용할 수 있다.(default value=False)

두 개의 텐서를 하나의 데이터셋으로 연결

특성을 위한 텐서와 레이블을 위한 텐서를 연결하여 하나의 데이터셋으로 만들면, 두 텐서의 원소를 튜플로 추출할 수 있다.

tf.random.set_seed(1)

t_x=tf.random.uniform([4, 3], dtype=tf.float32)

t_y=tf.range(4)

ds_x=tf.data.Dataset.from_tensor_slices(t_x)

ds_y=tf.data.Dataset.from_tensor_slices(t_y)

ds_joint=tf.data.Dataset.zip((ds_x, ds_y))

for example in ds_joint:

print(' x:', example[0].numpy(), ' y:', example[1].numpy())

x: [0.165 0.901 0.631] y: 0

x: [0.435 0.292 0.643] y: 1

x: [0.976 0.435 0.66 ] y: 2

x: [0.605 0.637 0.614] y: 3

위의 코드는 아래와 같이 tf.data.Dataset.from_tensor_slices()를 통해서 바로 연결할 수 있다.

ds_joint=tf.data.Dataset.from_tensor_slices((t_x, t_y))

for example in ds_joint:

print(' x:', example[0].numpy(), ' y:', example[1].numpy())

x: [0.165 0.901 0.631] y: 0

x: [0.435 0.292 0.643] y: 1

x: [0.976 0.435 0.66 ] y: 2

x: [0.605 0.637 0.614] y: 3

원본 특성(x)와 레이블(y)을 결합함으로써, 하나의 샘플을 섞어도 대응이 깨지는 경우는 없다.

데이터셋 변환

df_trans=ds_joint.map(lambda x, y: (x*2-1.0, y))

for example in ds_trans:

print(' x:', example[0].numpy(), ' y:', example[1].numpy())

x: [-0.67 0.803 0.262] y: 0

x: [-0.131 -0.416 0.285] y: 1

x: [ 0.952 -0.13 0.32 ] y: 2

x: [ 0.21 0.273 0.229] y: 3

기존의 [0, 1)의 변위를 가진 t_x 값을 [-1, 1)로 조정

tf.autograph.experimental.do_not_convert

shuffle(), batch(), repeat()

tf.random.set_seed(1)

ds=ds_joint.shuffle(buffer_size=len(t_x))

for example in ds:

print(' x:', example[0].numpy(), ' y:', example[1].numpy())

x: [0.9757855 0.43509948 0.6601019 ] y: 2

x: [0.4345461 0.29193902 0.64250207] y: 1

x: [0.16513085 0.9014813 0.6309742 ] y: 0

x: [0.60489583 0.6366315 0.6144488 ] y: 3

buffer_size 매개변수는 섞을 때, 버퍼에 있는 원소는 랜덤하게 추출되고 빈자리는 원본 데이터셋의 다음 원소로 채운다.

따라서 buffer_size가 작으면, 데이터셋이 완전히 섞이지 않을 수 있다.

에포크마다 완전히 섞기 위해서는 buffer_size=len(t_x)와 같이 지정하면 된다.

ds=ds_joint.batch(batch_size=3, drop_remainder=False)

batch_x, batch_y=next(iter(ds))

print('배치 x:\n', batch_x.numpy())

print('배치 y:\n', batch_y.numpy())

배치 x:

[[0.16513085 0.9014813 0.6309742 ]

[0.4345461 0.29193902 0.64250207]

[0.9757855 0.43509948 0.6601019 ]]

배치 y:

[0 1 2]

배치 데이터셋 2번 반복

ds=ds_joint.batch(3).repeat(count=2)

for i, (batch_x, batch_y) in enumerate(ds):

print(i, batch_x.shape, batch_y.numpy())

0 (3, 3) [0 1 2]

1 (1, 3) [3]

2 (3, 3) [0 1 2]

3 (1, 3) [3]

ds=ds_joint.repeat(count=2).batch(3)

for i, (batch_x, batch_y) in enumerate(ds):

print(i, batch_x.shape, batch_y.numpy())

0 (3, 3) [0 1 2]

1 (3, 3) [3 0 1]

2 (2, 3) [2 3]

배치를 만들고 반복할 경우 네개의 매치가 만들어진다.

반복을 먼저 만들고 배치를 할 경우 세 개의 배치가 만들어진다.

#순서1: shuffle->batch->repeat

tf.random.set_seed(1)

ds=ds_joint.shuffle(4).batch(2).repeat(3)

for i, (batch_x, batch_y) in enumerate(ds):

print(i, batch_x.shape, batch_y.numpy())

0 (2, 3) [2 1]

1 (2, 3) [0 3]

2 (2, 3) [0 3]

3 (2, 3) [1 2]

4 (2, 3) [3 0]

5 (2, 3) [1 2]

#순서2:batch->shuffle->repeat

tf.random.set_seed(1)

ds=ds_joint.batch(2).shuffle(4).repeat(3)

for i, (batch_x, batch_y) in enumerate(ds):

print(i, batch_x.shape, batch_y.numpy())

0 (2, 3) [0 1]

1 (2, 3) [2 3]

2 (2, 3) [0 1]

3 (2, 3) [2 3]

4 (2, 3) [2 3]

5 (2, 3) [0 1]

#순서3:batch->repeat->shuffle

tf.random.set_seed(1)

ds=ds_joint.batch(2).repeat(3).shuffle(4)

for i, (batch_x, batch_y) in enumerate(ds):

print(i, batch_x.shape, batch_y.numpy())

0 (2, 3) [0 1]

1 (2, 3) [0 1]

2 (2, 3) [2 3]

3 (2, 3) [2 3]

4 (2, 3) [0 1]

5 (2, 3) [2 3]

데이터셋 내부에서 batch() 메서드가 두번 호출될 경우, 배치의 배치가 결과를 추출되게 된다.

순서2에서 배치를 통해 2개의 배치가된 데이터셋을 4개의 버퍼를 통해서 섞었다.

배치 개수보다 버퍼의 개수가 크면 섞이지 않는다.

Build Dataset(preprocess, transform)